This project aims to build a workable model to predict if a client will subscribe a term deposit. Key features of clients who did subscribe a term deposit will be discovered and used for marketing. For example, what are the potential client group in terms of age, job, previous relationship with the bank, etc.
The project works on the original data set with 45211 observations and 17 attributes. The drawback of using the original data set is: the data set is unbalanced in the class label, with way more than no than yes. This gives the classification model a good accuracy but a bad specificity (the rate capturing the yes class.) Thus, in this project, the number of observation of “yes” in the “y” attribut will be replicated so that it equals to the number of observations of “no” in the “y” attribute.
Input variables: # bank client data: 1 - age (numeric) 2 - job : type of job (categorical: “admin.”,“unknown”,“unemployed”,“management”,“housemaid”,“entrepreneur”,“student”, “blue-collar”,“self-employed”,“retired”,“technician”,“services”) 3 - marital : marital status (categorical: “married”,“divorced”,“single”; note: “divorced” means divorced or widowed) 4 - education (categorical: “unknown”,“secondary”,“primary”,“tertiary”) 5 - default: has credit in default? (binary: “yes”,“no”) 6 - balance: average yearly balance, in euros (numeric) 7 - housing: has housing loan? (binary: “yes”,“no”) 8 - loan: has personal loan? (binary: “yes”,“no”) # related with the last contact of the current campaign: 9 - contact: contact communication type (categorical: “unknown”,“telephone”,“cellular”) 10 - day: last contact day of the month (numeric) 11 - month: last contact month of year (categorical: “jan”, “feb”, “mar”, …, “nov”, “dec”) 12 - duration: last contact duration, in seconds (numeric) # other attributes: 13 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact) 14 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted) 15 - previous: number of contacts performed before this campaign and for this client (numeric) 16 - poutcome: outcome of the previous marketing campaign (categorical: “unknown”,“other”,“failure”,“success”)
Output variable (desired target): 17 - y - has the client subscribed a term deposit? (binary: “yes”,“no”)
For source and attribute info of the data set, please refer to https://archive.ics.uci.edu/ml/datasets/bank+marketing
Import the data set
library(tidyverse)
bank <- read_csv2("/home/junyan26/DATA/Bank_Marketing/bank/bank-full.csv", col_types = cols())
Using ',' as decimal and '.' as grouping mark. Use read_delim() for more control.
nrow(bank)
[1] 45211
ncol(bank)
[1] 17
print(head(bank))
sapply(bank, function(x) sum(is.na(x)))
age job marital education default balance housing loan contact day month
0 0 0 0 0 0 0 0 0 0 0
duration campaign pdays previous poutcome y
0 0 0 0 0 0
There is no missing data in the data set.
Multiply the number of “yes” observations in the “y” attribute._
library(splitstackshape)
bank <- as.data.frame((bank))
num_no <- nrow(bank[bank$y == "no", ])
num_yes <- nrow(bank[bank$y == "yes", ])
rep_num <- round(num_no/num_yes)
new_yes_rows <- expandRows(bank[bank$y == "yes", ], count = rep_num, count.is.col = FALSE)
print(nrow(new_yes_rows))
[1] 42312
print(num_no)
[1] 39922
n_yes_no <- data.frame(c(nrow(new_yes_rows), num_no))
n_yes_no$class <- c("yes", "no")
colnames(n_yes_no) <- c("count", "class")
n_yes_no <- n_yes_no[, c(2, 1)]
n_yes_no
library(ggplot2)
ggplot(data=n_yes_no, aes(x=reorder(class, -count), y=count, fill=reorder(class, -count))) +
geom_bar(stat="identity", width=0.5)+
theme_classic() + ylab("Count") + xlab("classes in target variable \"y\"") +
ggtitle("Compare the Number of the Classes (After Over Sampling)") +
theme(plot.title = element_text(hjust = 0.5, size=14, face="bold"),
axis.text.x = element_text(hjust = 0.5, face="bold", size=14))+
guides(fill=FALSE, color=FALSE)
After the replication, the data is more balanced. The number of “yes” rows is 42312 and the number of “no” rows is 39922.
bank <- rbind(bank[bank$y == "no", ], new_yes_rows)
seed = 321
bank <- bank[sample(nrow(bank)),]
nrow(bank)
[1] 82234
ncol(bank)
[1] 17
The new data set has 82234 rows and 17 columns.
summary(bank)
age job marital education default balance
Min. :18.00 Length:82234 Length:82234 Length:82234 Length:82234 Min. : -8019
1st Qu.:32.00 Class :character Class :character Class :character Class :character 1st Qu.: 127
Median :39.00 Mode :character Mode :character Mode :character Mode :character Median : 563
Mean :41.27 Mean : 1561
3rd Qu.:49.00 3rd Qu.: 1757
Max. :95.00 Max. :102127
housing loan contact day month duration
Length:82234 Length:82234 Length:82234 Min. : 1.00 Length:82234 Min. : 0.0
Class :character Class :character Class :character 1st Qu.: 8.00 Class :character 1st Qu.: 145.0
Mode :character Mode :character Mode :character Median :15.00 Mode :character Median : 263.0
Mean :15.51 Mean : 383.8
3rd Qu.:21.00 3rd Qu.: 515.0
Max. :31.00 Max. :4918.0
campaign pdays previous poutcome y
Min. : 1.000 Min. : -1.00 Min. : 0.000 Length:82234 Length:82234
1st Qu.: 1.000 1st Qu.: -1.00 1st Qu.: 0.000 Class :character Class :character
Median : 2.000 Median : -1.00 Median : 0.000 Mode :character Mode :character
Mean : 2.483 Mean : 53.03 Mean : 0.846
3rd Qu.: 3.000 3rd Qu.: 75.00 3rd Qu.: 1.000
Max. :63.000 Max. :871.00 Max. :275.000
Turn the “character” class variables into “factor”.
sapply(bank, class)
age job marital education default balance housing loan contact
"numeric" "character" "character" "character" "character" "numeric" "character" "character" "character"
day month duration campaign pdays previous poutcome y
"numeric" "character" "numeric" "numeric" "numeric" "numeric" "character" "character"
c <- colnames(dplyr::select_if(bank, is.character))
bank[c] <- lapply(bank[c], factor)
sapply(bank, class)
age job marital education default balance housing loan contact day month
"numeric" "factor" "factor" "factor" "factor" "numeric" "factor" "factor" "factor" "numeric" "factor"
duration campaign pdays previous poutcome y
"numeric" "numeric" "numeric" "numeric" "factor" "factor"
Turn categorical variables into numerical.
bank_numeric <- bank
sapply(bank_numeric, class)
age job marital education default balance housing loan contact day month
"numeric" "factor" "factor" "factor" "factor" "numeric" "factor" "factor" "factor" "numeric" "factor"
duration campaign pdays previous poutcome y
"numeric" "numeric" "numeric" "numeric" "factor" "factor"
must_convert <- sapply(bank_numeric,is.factor)
bank_numeric_temp <- sapply(bank_numeric[,must_convert],unclass)
bank_numeric <- cbind(bank_numeric[,!must_convert],bank_numeric_temp)
#Reorder the column names. Make them in sync with bank.
cnames = colnames(bank)
bank_numeric <- bank_numeric[, cnames]
head(bank_numeric)
“bank_numeric” is a dataset generated from “bank” and with all numerical data.
Examine the correlation and significant level(pvalue) between variables.
library(Hmisc)
Loading required package: survival
Attaching package: ‘survival’
The following object is masked from ‘package:caret’:
cluster
Loading required package: Formula
Attaching package: ‘Hmisc’
The following objects are masked from ‘package:dplyr’:
src, summarize
The following objects are masked from ‘package:base’:
format.pval, units
options(scipen=999)
cor_p <- rcorr(x = data.matrix(bank_numeric)[, 1:16], y = bank_numeric[, 17], type = c("pearson", "spearman"))
Correlation Matrx:
round(cor_p$r, 2)
age job marital education default balance housing loan contact day month duration campaign
age 1.00 -0.03 -0.45 -0.13 -0.02 0.11 -0.18 -0.03 0.03 0.00 -0.03 -0.01 -0.01
job -0.03 1.00 0.08 0.15 -0.01 0.02 -0.14 -0.05 -0.09 0.02 -0.08 0.00 -0.01
marital -0.45 0.08 1.00 0.13 -0.01 0.00 -0.03 -0.05 -0.06 0.00 -0.01 0.00 -0.02
education -0.13 0.15 0.13 1.00 -0.01 0.07 -0.11 -0.06 -0.12 0.01 -0.05 -0.02 -0.01
default -0.02 -0.01 -0.01 -0.01 1.00 -0.06 0.02 0.08 0.03 0.01 0.00 0.00 0.02
balance 0.11 0.02 0.00 0.07 -0.06 1.00 -0.08 -0.09 -0.03 0.01 0.01 0.01 -0.02
housing -0.18 -0.14 -0.03 -0.11 0.02 -0.08 1.00 0.09 0.22 -0.01 0.21 0.04 0.02
loan -0.03 -0.05 -0.05 -0.06 0.08 -0.09 0.09 1.00 0.03 0.01 0.03 0.01 0.03
contact 0.03 -0.09 -0.06 -0.12 0.03 -0.03 0.22 0.03 1.00 0.00 0.28 -0.01 0.06
day 0.00 0.02 0.00 0.01 0.01 0.01 -0.01 0.01 0.00 1.00 -0.02 -0.01 0.13
month -0.03 -0.08 -0.01 -0.05 0.00 0.01 0.21 0.03 0.28 -0.02 1.00 0.00 -0.09
duration -0.01 0.00 0.00 -0.02 0.00 0.01 0.04 0.01 -0.01 -0.01 0.00 1.00 -0.03
campaign -0.01 -0.01 -0.02 -0.01 0.02 -0.02 0.02 0.03 0.06 0.13 -0.09 -0.03 1.00
pdays 0.00 0.00 0.02 0.02 -0.03 0.02 0.07 -0.04 -0.23 -0.06 0.03 -0.04 -0.10
previous 0.02 0.02 0.02 0.03 -0.03 0.03 0.00 -0.03 -0.16 -0.05 0.03 -0.03 -0.05
poutcome -0.01 -0.01 -0.02 -0.04 0.04 -0.03 -0.05 0.04 0.25 0.06 -0.04 0.05 0.11
y 0.03 0.06 0.07 0.10 -0.04 0.08 -0.22 -0.12 -0.26 -0.04 -0.04 0.45 -0.13
pdays previous poutcome y
age 0.00 0.02 -0.01 0.03
job 0.00 0.02 -0.01 0.06
marital 0.02 0.02 -0.02 0.07
education 0.02 0.03 -0.04 0.10
default -0.03 -0.03 0.04 -0.04
balance 0.02 0.03 -0.03 0.08
housing 0.07 0.00 -0.05 -0.22
loan -0.04 -0.03 0.04 -0.12
contact -0.23 -0.16 0.25 -0.26
day -0.06 -0.05 0.06 -0.04
month 0.03 0.03 -0.04 -0.04
duration -0.04 -0.03 0.05 0.45
campaign -0.10 -0.05 0.11 -0.13
pdays 1.00 0.46 -0.80 0.15
previous 0.46 1.00 -0.51 0.14
poutcome -0.80 -0.51 1.00 -0.12
y 0.15 0.14 -0.12 1.00
round(cor_p$P, 2)
age job marital education default balance housing loan contact day month duration campaign pdays
age NA 0.00 0.00 0.00 0.00 0.00 0.00 0 0.00 0.59 0.00 0.12 0.02 0.21
job 0.00 NA 0.00 0.00 0.00 0.00 0.00 0 0.00 0.00 0.00 0.87 0.02 0.88
marital 0.00 0.00 NA 0.00 0.01 0.44 0.00 0 0.00 0.46 0.01 0.71 0.00 0.00
education 0.00 0.00 0.00 NA 0.00 0.00 0.00 0 0.00 0.02 0.00 0.00 0.02 0.00
default 0.00 0.00 0.01 0.00 NA 0.00 0.00 0 0.00 0.05 0.34 0.25 0.00 0.00
balance 0.00 0.00 0.44 0.00 0.00 NA 0.00 0 0.00 0.07 0.02 0.00 0.00 0.00
housing 0.00 0.00 0.00 0.00 0.00 0.00 NA 0 0.00 0.02 0.00 0.00 0.00 0.00
loan 0.00 0.00 0.00 0.00 0.00 0.00 0.00 NA 0.00 0.00 0.00 0.00 0.00 0.00
contact 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 NA 0.64 0.00 0.11 0.00 0.00
day 0.59 0.00 0.46 0.02 0.05 0.07 0.02 0 0.64 NA 0.00 0.04 0.00 0.00
month 0.00 0.00 0.01 0.00 0.34 0.02 0.00 0 0.00 0.00 NA 0.68 0.00 0.00
duration 0.12 0.87 0.71 0.00 0.25 0.00 0.00 0 0.11 0.04 0.68 NA 0.00 0.00
campaign 0.02 0.02 0.00 0.02 0.00 0.00 0.00 0 0.00 0.00 0.00 0.00 NA 0.00
pdays 0.21 0.88 0.00 0.00 0.00 0.00 0.00 0 0.00 0.00 0.00 0.00 0.00 NA
previous 0.00 0.00 0.00 0.00 0.00 0.00 0.86 0 0.00 0.00 0.00 0.00 0.00 0.00
poutcome 0.05 0.10 0.00 0.00 0.00 0.00 0.00 0 0.00 0.00 0.00 0.00 0.00 0.00
y 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 0.00 0.00 0.00 0.00 0.00 0.00
previous poutcome y
age 0.00 0.05 0
job 0.00 0.10 0
marital 0.00 0.00 0
education 0.00 0.00 0
default 0.00 0.00 0
balance 0.00 0.00 0
housing 0.86 0.00 0
loan 0.00 0.00 0
contact 0.00 0.00 0
day 0.00 0.00 0
month 0.00 0.00 0
duration 0.00 0.00 0
campaign 0.00 0.00 0
pdays 0.00 0.00 0
previous NA 0.00 0
poutcome 0.00 NA 0
y 0.00 0.00 NA
#install.packages("corrplot")
library(corrplot)
corrplot 0.84 loaded
M <- cor_p$r
p_mat <- cor_p$P
corrplot(M, type = "upper",col = heat.colors(100),bg = "darkblue",
p.mat = p_mat, sig.level = 0.05)
__The correlation matrix is using Pearson Correlation and Spearman’s Rank Correlation, the formaer of which designed for ratio and intervel data, while the latter of which for ordinal, ratio and interval. It is not very accuray because the not all the variables are of these data type. Still , it can be a reference of how the data seem to be correlated.
In the majority of analyses, an alpha of 0.05 is used as the cutoff for significance. If the p-value is less than 0.05, we reject the null hypothesis that there’s no correlation between the variables and conclude that a significant correlation does exist. If the p-value is larger than 0.05, we CANNOT reject the null hypothese and CANNOT conclude that a significant correlation exists.
The crosses in the plot indicate p-value higher than 0.05, which means we cannot conclude that the correlation between the varibales is significant. For the rest of them, we can conclude that the corralation is significant.
Thus, all variables has significant correlation with the “y”, although the correlation coefficients are not high.
The correlation between “pdays” and “poutcome” is negatively strong, which need to look closer later.
Scatterplot Matrix
The points in a scatterplot matrix can be colored by the class label in classification problems. This can help to spot clear (or unclear) separation of classes and perhaps give an idea of how difficult the problem may.
pairs(y~., data = as.matrix(bank), col = bank$y, lower.panel = NULL)
The pair plot shows how the variable pairs can seperate the “y” variable. Overall, the seperation is not very clear. Categorical and numerical variables are also included, which makes the plot unnecessarily big.
Levels of the factor variables:
sapply(dplyr::select_if(bank, is.factor), levels)
$job
[1] "admin." "blue-collar" "entrepreneur" "housemaid" "management" "retired"
[7] "self-employed" "services" "student" "technician" "unemployed" "unknown"
$marital
[1] "divorced" "married" "single"
$education
[1] "primary" "secondary" "tertiary" "unknown"
$default
[1] "no" "yes"
$housing
[1] "no" "yes"
$loan
[1] "no" "yes"
$contact
[1] "cellular" "telephone" "unknown"
$month
[1] "apr" "aug" "dec" "feb" "jan" "jul" "jun" "mar" "may" "nov" "oct" "sep"
$poutcome
[1] "failure" "other" "success" "unknown"
$y
[1] "no" "yes"
Summary of numeric variable:
sapply(dplyr::select_if(bank, is.numeric), summary)
age balance day duration campaign pdays previous
Min. 18.00000 -8019.000 1.0000 0.0000 1.00000 -1.00000 0.000000
1st Qu. 32.00000 127.000 8.0000 145.0000 1.00000 -1.00000 0.000000
Median 39.00000 563.000 15.0000 263.0000 2.00000 -1.00000 0.000000
Mean 41.26661 1561.265 15.5146 383.8323 2.48345 53.03128 0.845964
3rd Qu. 49.00000 1757.000 21.0000 515.0000 3.00000 75.00000 1.000000
Max. 95.00000 102127.000 31.0000 4918.0000 63.00000 871.00000 275.000000
From summary, we can see “age” seems normally distributed. “balance” is very widely spread. “day” itself doesn’t make a lot of sense and may be combined with “month”. “duration”, “campaign”, “pdays” and “previous” seem to be skewed.
Visualization of the Data Set
ggplot(bank, aes(x=age, fill=y, color=y))+
geom_histogram(position="identity", alpha=0.5, binwidth=2)+
stat_function(fun = function(bank, mean, sd, n){
n * dnorm(x = bank, mean = mean, sd = sd) },
args = with(bank, c(mean = mean(age), sd = sd(age), n
= length(age))), col="blue") +
scale_color_manual(values=c("lightblue", "coral"))+
scale_fill_manual(values=c("lightblue", "coral")) +
theme_classic() +
ggtitle("Age Distribution Colored by Y") +
theme(plot.title = element_text(hjust = 0.5, size=14, face="bold"))
Red and blue area seperated a little, but overal, it is overlap. Age is nomally distributed.
spineplot(x = bank$age, y = bank$y, xlab = "age", ylab = "y", breaks=20,
main = "Age vs Y", col = c("lightblue", "coral"))
More than 80% of people aged older than 60 and younger than 18 subsribe a term deposit. But this two groups are only small number of people.
ggplot(bank, aes(x=balance, fill=y, color=y))+
geom_histogram(position="identity", alpha=0.5, binwidth=2000)+
scale_color_manual(values=c("lightblue", "coral"))+
scale_fill_manual(values=c("lightblue", "coral")) +
theme_classic() +
ggtitle("Balance Distribution Colored by Y") +
theme(plot.title = element_text(hjust = 0.5, size=14, face="bold"))
“balance” is not normally distributed. The two color has a little seperation.
spineplot(x = bank$balance, y = bank$y, xlab = "balance", ylab = "y", breaks=100,
main = "balance vs y", col = c("lightblue", "coral"))
spineplot(x = bank[bank$balance > 25000, "balance"], y = bank[bank$balance > 25000, "y"], xlab = "balance", ylab = "y", breaks=100,
main = "balance (>25000) vs y", col = c("lightblue", "coral"))
Balance larger than 25000 has more “yes”.
bank[bank$balance > 65000, c("balance", "y")]
spineplot(x = bank[bank$balance < -600, "balance"], y = bank[bank$balance < -600, "y"], xlab = "balance", ylab = "y", breaks=100,
main = "balance (<0) vs y", col = c("lightblue", "coral"))
It seems like for balance more than 25000, the more balance, the higher proportion of people subsribe a term deposit. But this is very small number of group. For negative balance, there are less people subscribe.
Plot age, balance and y
#library(ggplot2)
library(ggpubr)
Loading required package: magrittr
Attaching package: ‘magrittr’
The following object is masked from ‘package:purrr’:
set_names
The following object is masked from ‘package:tidyr’:
extract
# Grouped Scatter plot with marginal density plots
ggscatterhist(
bank[, c("age", "balance", "y")], x = "age", y = "balance",
color = "y", size = 3, alpha = 0.6,
palette = c("lightblue", "coral"), legend = "right")
The two colors are almost well moix. “age” and “balance” together don’t seperate “yes” and “no” in “y” well.
ggplot(bank, aes(x=duration, fill=y, color=y))+
geom_histogram(position="identity", alpha=0.5, binwidth=100)+
scale_color_manual(values=c("lightblue", "coral"))+
scale_fill_manual(values=c("lightblue", "coral")) +
theme_classic() +
ggtitle("Duration Distribution Colored by Y") +
theme(plot.title = element_text(hjust = 0.5, size=14, face="bold"))+
xlab("duration: last contact duration, in seconds")
Two colors have a obvious seperation. “duration” seperates the two categories of “y” pretty well.
spineplot(x = bank$duration, y = bank$y, xlab = "duration: last contact duration, in seconds", ylab = "y", breaks=100,
main = "Duration vs Y", col = c("lightblue", "coral"))
“duration” and “y”are pretty strongly associated. The longer duration is, the bigger prportion of people subscibe a term deposit.
summary(bank$duration)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0 145.0 263.0 383.8 515.0 4918.0
ggplot(bank, aes(x=campaign, fill=y, color=y))+
geom_histogram(position="identity", alpha=0.5, binwidth=1)+
scale_color_manual(values=c("lightblue", "coral"))+
scale_fill_manual(values=c("lightblue", "coral")) +
theme_classic() +
ggtitle("Campaign Distribution Colored by Y") +
theme(plot.title = element_text(hjust = 0.5, size=14, face="bold")) +
xlab("campaign: number of contacts performed during this campaign")
The two colors overlap well in the middle and have some seperation at the two ends.
summary(bank$campaign)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 1.000 2.000 2.483 3.000 63.000
Unique number in “campaign”:
sort(unique(bank$campaign))
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
[36] 36 37 38 39 41 43 44 46 50 51 55 58 63
aggregate(data.frame(count = bank$campaign), list(value = bank$campaign), length)
Most of the campaign is on 1 and 2.
spineplot(x = bank$campaign, y = bank$y, xlab = "campaign: number of contacts performed during this campaign", ylab = "y", breaks=50,
main = "Campaign vs Y", col = c("lightblue", "coral"))
Seem to have some trend.
df_cam <- bank[bank$campaign > 3, ]
spineplot(x = df_cam$campaign, y = df_cam$y, xlab = "campaign: number of contacts performed during this campaign", ylab = "y", breaks=50, main = "Campaign (>3) vs Y", col = c("lightblue", "coral"))
There is a trend that the more number of campaign, the less percentage of clients substribe a term deposit, Expecially for campaign more than 3.
library(ggpubr)
# Grouped Scatter plot with marginal density plots
ggscatterhist(
bank[, c("campaign", "duration", "y")], x = "campaign", y = "duration",
color = "y", size = 3, alpha = 0.6,
palette = c("lightblue", "coral"), legend = "right")+
ggtitle("Campaign and Duration Colored by Y") +
theme(plot.title = element_text(hjust = 0.5, size=14, face="bold"))
The two colors seperated. May be because “duration” is a predictor. But “campaign” should be kept as well.
summary(bank$previous)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 0.000 0.000 0.846 1.000 275.000
Unique number in “previous”:
sort(unique(bank$previous))
[1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[27] 26 27 28 29 30 32 35 37 38 40 41 51 55 58 275
aggregate(data.frame(count = bank$previous), list(value = bank$previous), length)
library(ggplot2)
ggplot(bank, aes(x=previous, fill=y, color=y))+
geom_histogram(position="identity", alpha=0.5, binwidth=10)+
scale_color_manual(values=c("lightblue", "coral"))+
scale_fill_manual(values=c("lightblue", "coral")) +
theme_classic() +
ggtitle("Previous Distribution Colored by Y") +
theme(plot.title = element_text(hjust = 0.5, size=14, face="bold")) +
xlab("previous: number of contacts performed before this campaign")
Previous is not unbalanced. Remove previous > 10 and plot again
library(ggplot2)
df_prev <- bank[bank$previous < 10,]
ggplot(df_prev, aes(x=previous, fill=y, color=y))+
geom_histogram(position="identity", alpha=0.5, binwidth=1)+
scale_color_manual(values=c("lightblue", "coral"))+
scale_fill_manual(values=c("lightblue", "coral")) +
theme_classic() +
ggtitle("Previous ( < 10) Distribution Colored by Y") +
theme(plot.title = element_text(hjust = 0.5, size=14, face="bold"))+
xlab("previous: number of contacts performed before this campaign")
Two colors has some sort of seperation.
spineplot(x = bank$previous, y = bank$y, xlab = "previous: number of contacts performed before this campaign", ylab = "y", breaks=200, main = "Previous vs Y", col = c("lightblue", "coral"))
Less than 50% of people that has no previous contact subscribe a term deposit. And most people don’t have previous contact. And more previous contact seems to lead to more percentage of people who subscribe a term deposit. May be previous should be convert into boolean, 0 = false, >0 = true
tem <- bank
tem$previous1 <- tem$previous > 0
tem$previous1 <- as.factor(tem$previous1)
spineplot(x = tem$previous1, y = tem$y, xlab = "Previous (bool)", ylab = "y", breaks=lims,
main = "Previous (bool) vs Y", col = c("lightblue", "coral"), xaxlabels = levels(tem$previous1))
Now the variable makes more sense.
library(ggplot2)
ggplot(bank, aes(x=pdays, fill=y, color=y))+
geom_histogram(position="identity", alpha=0.5, binwidth=100)+
scale_color_manual(values=c("lightblue", "coral"))+
scale_fill_manual(values=c("lightblue", "coral")) +
theme_classic() +
ggtitle("Pdays Distribution Colored by Y") +
theme(plot.title = element_text(hjust = 0.5, size=14, face="bold"))+
xlab("pdays: number of days that passed by after the client was \nlast contacted from a previous campaign (-1 means client was not previously contacted)")
“Pdays” are most -1 (-1 means client was not previously contacted). So this variable should have a lot of duplicate as the “previous” variable.
library(ggplot2)
df_pdays <- bank[bank$pdays == -1,]
ggplot(df_pdays, aes(x=pdays, fill=y, color=y))+
geom_histogram(position="identity", alpha=0.3, binwidth=0.5)+
scale_color_manual(values=c("lightblue", "coral"))+
scale_fill_manual(values=c("lightblue", "coral")) +
theme_classic() +
ggtitle("Pdays (=-1) Distribution Colored by Y") +
theme(plot.title = element_text(hjust = 0.5, size=14, face="bold"))+
xlab("pdays: number of days that passed by after the client was \nlast contacted from a previous campaign (-1 means client was not previously contacted)")
spineplot(x = bank$pdays, y = bank$y, xlab = "pdays", ylab = "y", breaks=1000, main = "Pdays vs Y", col = c("lightblue", "coral"))
df_pdays <- bank[bank$pdays >0, ]
spineplot(x = df_pdays$pdays, y = df_pdays$y, xlab = "pdays", ylab = "y", breaks=100, main = "Pdays (>0) vs Y", col = c("lightblue", "coral"))
bank[bank$pdays>0,]
Most clients didn’t get contact previous, and less than half of them subscribe a term deposit. For those who got contacted, pdays 80-100 and 180-190 have the highest subscribe rate (more than 80%).
Convert “pdays” into bool
tem$pdays1 <- tem$pdays > 0
tem$pdays1 <- as.factor(tem$pdays1)
spineplot(x = tem$pdays1, y = tem$y, xlab = "Pdays (bool)", ylab = "y", breaks=lims,
main = "Pdays (bool) vs Y", col = c("lightblue", "coral"), xaxlabels = levels(tem$pdays1))
Compare previous and pdays
tem$previous1 <- as.factor(tem$previous1)
spineplot(x = tem$pdays1, y = tem$previous1, xlab = "Pdays (bool)", ylab = "y", breaks=lims,
main = "Pdays (bool) vs Previous (bool) ", col = c("lightblue", "coral"), xaxlabels = levels(tem$pdays1))
pdays (bool) and previous (bool) are total overlap. Since pdays is strongly associated with previous and poutcome (shown later). It should be droped when performing classification.
Combine cpmpaign and previous, named “contact_times”
tem$contact_times <- tem$campaign + tem$previous
library(ggplot2)
ggplot(tem, aes(x=contact_times, fill=y, color=y))+
geom_histogram(position="identity", alpha=0.5, binwidth=5)+
scale_color_manual(values=c("lightblue", "coral"))+
scale_fill_manual(values=c("lightblue", "coral")) +
theme_classic() +
ggtitle("Contact_Times (Campaign + Previous) Distribution Colored by Y") +
theme(plot.title = element_text(hjust = 0.5, size=14, face="bold"))+
xlab("contact_times: the total number of being contacted \nbefore this campaign and during this compaign")
spineplot(x = tem$contact_times, y = tem$y, xlab = "contact_times: the total number of being contacted \nbefore this campaign and during this compaign", ylab = "y", breaks=200,
main = "Contact_Times (Campaign + Previous) vs Y", col = c("lightblue", "coral"))
The contact_times variable shows less power of splitting “yes” and “no” in the y variable. So it should not be used.
Plot month:
ggplot(bank, aes(x=month, fill=y, color=y))+
geom_histogram(position="identity", alpha=0.5, stat="count")+
scale_color_manual(values=c("lightblue", "coral"))+
scale_fill_manual(values=c("lightblue", "coral")) +
theme_classic() +
ggtitle("Month Distribution Colored by Y") +
theme(plot.title = element_text(hjust = 0.5, size=14, face="bold"))
Ignoring unknown parameters: binwidth, bins, pad
Two colors have some sort of seperation.
Plot day of month:
ggplot(bank, aes(x=day, fill=y, color=y))+
geom_histogram(position="identity", alpha=0.5, stat="count")+
scale_color_manual(values=c("lightblue", "coral"))+
scale_fill_manual(values=c("lightblue", "coral")) +
theme_classic() +
ggtitle("Day Distribution Colored by Y") +
theme(plot.title = element_text(hjust = 0.5, size=14, face="bold"))
Ignoring unknown parameters: binwidth, bins, pad
Two colors have some seperation.
Combine day and month
summary(bank[, c("day", "month")])
day month
Min. : 1.00 may :20241
1st Qu.: 8.00 jul :11284
Median :15.00 aug :11063
Mean :15.51 jun : 9163
3rd Qu.:21.00 apr : 6971
Max. :31.00 nov : 6791
(Other):16721
library(magrittr)
library(lubridate)
Attaching package: ‘lubridate’
The following object is masked from ‘package:base’:
date
mo2Num <- function(x) match(tolower(x), tolower(month.abb))
month <- as.double(mo2Num(bank$month))
day <- as.double(bank$day)
m_d <- paste("2018", month, day, sep="-") %>% ymd() %>% as.Date()
m_d <- setNames(data.frame(m_d), "date")
head(m_d)
tem <- cbind(tem, m_d)
#remove the year
tem$date <- format(tem$date, format="%m-%d")
tem <- tem[, -19]
head(tem)
Plot date (day and month)
length(unique(tem$date))
[1] 318
Group “y” by each unique date
library(dplyr)
date_y <- tem %>%
group_by(date, y) %>%
summarise(n = n())
head(date_y)
date_y <- data.frame(date_y)
date_y_yes <- date_y[date_y$y=="yes", c("date", "n")]
rownames(date_y_yes) <- date_y_yes$date
head(date_y_yes)
date_y_no <- date_y[date_y$y=="no", c("date", "n")]
rownames(date_y_no) <- date_y_no$date
head(date_y_no)
ggplot(data=date_y, aes(x=date, y= n, group = y, fill=y, color=y)) + geom_line()+
scale_color_manual(values=c("lightblue", "coral"))+
scale_fill_manual(values=c("lightblue", "coral")) +
theme_classic() +
ggtitle("Date Distribution Colored by Y")+
theme(plot.title = element_text(hjust = 0.5, size=14, face="bold"), axis.text.x= element_blank()) +
ylab("count")
The two colors seperated a little, but overall it seems to be the same as month. So it shoube be better to keep month and day seperated.
ggplot(bank, aes(x=poutcome, fill=y, color=y))+
geom_histogram(position="identity", alpha=0.5, stat="count")+
scale_color_manual(values=c("lightblue", "coral"))+
scale_fill_manual(values=c("lightblue", "coral")) +
theme_classic() +
ggtitle("Poutcome Distribution Colored by Y") +
theme(plot.title = element_text(hjust = 0.5, size=14, face="bold"))
Ignoring unknown parameters: binwidth, bins, pad
Success has much more “yes” than “no”.
levels(bank$poutcome)
[1] "failure" "other" "success" "unknown"
From the pair plot, “pdays” and “poutcome” together can identify most one category of “y” and can barely identify the other. “y” is boolean, which red correla
par(las = 2, cex.axis = 0.75, mar = c(7, 4.1, 4.1, 2.1))
spineplot(x = bank$poutcome, y = bank$y, xlab = "poutcome", ylab = "y", breaks=lims,
main = "poutcome vs y", col = c("lightblue", "coral"))
If a client subsribed a term deposit, he/ she is very likely to subscribe in this campaign. Poutcome seems to be a strong predictor.
Plot pdays and poutcome
par(las = 1, cex.axis = 0.75, mar = c(7, 4.1, 4.1, 2.1))
spineplot(x = bank$pday, y = bank$poutcome, xlab = "pdays", ylab = "poutcome", breaks=1000,
main = "pdays vs poutcome", col = c("lightblue", "grey","coral", "white"))
dfpp <- bank[bank$pdays >-1, ]
par(las = 1, cex.axis = 0.75, mar = c(7, 4.1, 4.1, 2.1))
spineplot(x = dfpp$pday, y = dfpp$poutcome, xlab = "pdays", ylab = "poutcome", breaks=2000,
main = "pdays vs poutcome", col = c("lightblue", "grey","coral", "white"))
All poutcome is unknown if pdays = -1, meaning the client was not contacted from a previous compaign.
Most of the clients with pdays of 88-93 and 176 - 186 that were previously contacted subscribed a term deposit.
Pdays seems to be strongly correlated to poutcome. So it may be removed.
par(las = 2, cex.axis = 0.75, mar = c(7, 4.1, 4.1, 2.1))
spineplot(x = bank$job, y = bank$y, xlab = "job", ylab = "y",
main = "Job vs Y", col = c("lightblue", "coral"), xaxlabels = levels(bank$job))
Overall, job has some difference in “yes” and “no” among its categories. It may be a good predictor.
spineplot(x = bank$marital, y = bank$y, xlab = "marital", ylab = "y",
main = "Marital vs Y", col = c("lightblue", "coral"), xaxlabels = levels(bank$marital))
Both marital have equal number of “yes” and “no” among its categories. It seems to be a very weak predictor.
spineplot(x = bank$job, y = bank$marital, xlab = "marital", ylab = "job",
main = "Marital vs Job", col = c("lightblue", "coral", "grey"), xaxlabels = levels(bank$job))
Most students are single and most retired peopele are married. But the two variables are pretty mixed overall. So they seem not strongly associated.
spineplot(x = bank$education, y = bank$y, xlab = "education", ylab = "y",
main = "Education vs Y", col = c("lightblue", "coral"), xaxlabels = levels(bank$education))
The “yes” and “no” are pretty evenly seperated. “education” may not be a good predictor.
spineplot(x = bank$default, y = bank$y, xlab = "default", ylab = "y",
main = "Default vs Y", col = c("lightblue", "coral"), xaxlabels = levels(bank$default))
“default” almost perfectly split yes and no. Obviously, it has almost not association with “y”.
par(las = 2, cex.axis = 0.75, mar = c(7, 4.1, 4.1, 2.1))
spineplot(x = bank$job, y = bank$default, xlab = "job", ylab = "default",
main = "Default vs Job", col = c("lightblue", "coral"), xaxlabels = levels(bank$job))
Almost all retired and student clients have no defult. And all other jobs have default. Default has very small number “yes” and it doesn’t identify “y” well. “default” can be deleted.
spineplot(x = bank$housing, y = bank$y, xlab = "housing", ylab = "y",
main = "Housing vs Y", col = c("lightblue", "coral"), xaxlabels = levels(bank$housing))
“housing” has some difference in the proportion of “yes” and “no”. But not very obvious.
par(las = 2, cex.axis = 0.75, mar = c(7, 4.1, 4.1, 2.1))
spineplot(x = bank$job, y = bank$housing, xlab = "job", ylab = "housing",
main = "Job vs Housing", col = c("lightblue", "coral"), xaxlabels = levels(bank$job))
“housing” is widely spread to all jobs. so “housing” seems to have weak association with “job”.
ggplot(bank, aes(x=age, fill=housing, color=housing))+
geom_histogram(position="identity", alpha=0.5)+
stat_function(fun = function(bank, mean, sd, n){
n * dnorm(x = bank, mean = mean, sd = sd) },
args = with(bank, c(mean = mean(age), sd = sd(age), n
= length(age))), col="blue") +
scale_color_manual(values=c("lightblue", "coral"))+
scale_fill_manual(values=c("lightblue", "coral")) +
theme_classic() +
ggtitle("Balance Distribution Colored by Housing") +
theme(plot.title = element_text(hjust = 0.5, size=14, face="bold"))
“housing” seems to associated with “age”. Correlation of these two should be tested.
spineplot(x = bank$balance, y = bank$housing, xlab = "balance", ylab = "housing", breaks=100,
main = "Balance vs Housing", col = c("lightblue", "coral"))
“housing” seems to widely in balance.
spineplot(x = bank$loan, y = bank$y, xlab = "loan", ylab = "y",
main = "Loan vs Y", col = c("lightblue", "coral"), xaxlabels = levels(bank$loan))
No dramactic split.
spineplot(x = bank$housing, y = bank$loan, xlab = "housing", ylab = "loan",
main = "Housing vs Loan", col = c("lightblue", "coral"), xaxlabels = levels(bank$housing))
No strong association.
par(las = 2, cex.axis = 0.75, mar = c(7, 4.1, 4.1, 2.1))
spineplot(x = bank$contact, y = bank$y, xlab = "contact", ylab = "y",
main = "Contact vs Y", col = c("lightblue", "coral"), xaxlabels = levels(bank$contact))
May have some association.
In the above visualization, some variable seem to have more association than the others. Here is the summery of the predictor variables:
Good: duration, job, age, poucome medium: month, day, previous (bool) small: marital, education, balance, housing, loan, contact, campaign should not use: default, pdays
Run the tests to further identify the strength of the predictor variables.
Chi-Squared Test is for two categorical variables. Variables include: job, poutcome, month, previous(bool), education, housing, loan, contact, default
Between predictor variables: job and default
Sample 1% of data for chi-squared test of independence.
set.seed(789)
inde_test <- bank[sample(seq_len(nrow(bank)), size = floor(0.01 * nrow(bank))), ]
nrow(inde_test)
[1] 822
Contingency tables of y ~ job, poutcome, month, day (need to convert to factor), previous (bool), education, housing, loan, contact and default
job
Chi-squared test: H0: The two variables are independent. H1: The two variables are related.
y_job <- table(inde_test$job, inde_test$y)
y_job
no yes
admin. 39 57
blue-collar 90 62
entrepreneur 12 13
housemaid 9 9
management 79 101
retired 20 39
self-employed 15 13
services 41 31
student 5 29
technician 49 76
unemployed 13 16
unknown 1 3
Pooling: When a variable has more than two categories, and some of them have small numbers, it often makes sense to pool some of the categories together.
Combine “unknown”, “housemaid” and student into a new category “others”
jobs <- levels(inde_test$job)
jobs
[1] "admin." "blue-collar" "entrepreneur" "housemaid" "management" "retired"
[7] "self-employed" "services" "student" "technician" "unemployed" "unknown"
library(rockchalk)
Attaching package: ‘rockchalk’
The following object is masked from ‘package:Hmisc’:
summarize
The following object is masked from ‘package:dplyr’:
summarize
inde_test$new_jobs <- combineLevels(inde_test$job, levs = c("student", "housemaid", "unknown",
"unemployed", "entrepreneur"),
newLabel = c("Others"))
The original levels admin. blue-collar entrepreneur housemaid management retired self-employed services student technician unemployed unknown
have been replaced by admin. blue-collar management retired self-employed services technician Others
y_new_job <- table(inde_test$new_job, inde_test$y)
y_new_job
no yes
admin. 39 57
blue-collar 90 62
management 79 101
retired 20 39
self-employed 15 13
services 41 31
technician 49 76
Others 40 70
chisq.test(y_new_job)
Pearson's Chi-squared test
data: y_new_job
X-squared = 26.082, df = 7, p-value = 0.0004869
Critical Value (use 95% confidence level). (If Chi Square value >= Critical Value, reject the null hypothesis. If Chi Square value < Critical Value, fail to reject the null hypothesis.)
df = (8-1) * (2-1)
qchisq(0.95, df)
[1] 14.06714
library(questionr)
Attaching package: ‘questionr’
The following objects are masked from ‘package:Hmisc’:
describe, wtd.mean, wtd.table, wtd.var
job_v <- cramer.v(y_new_job)
job_v
[1] 0.1781296
p-value is very small and chi-squared value is larger than critical value. So reject null hypothesis. There is association between job and y. The association is medium strong based on the cramer’s v value.
poutcome
library(questionr)
print("Contingency Table of Poutcome and Y:")
[1] "Contingency Table of Poutcome and Y:"
po_y <- table(inde_test$poutcome, inde_test$y)
po_y
no yes
failure 37 66
other 11 26
success 5 76
unknown 320 281
print("Chi-Squared Test of Poutcome and Y Table:")
[1] "Chi-Squared Test of Poutcome and Y Table:"
chisq.test(po_y)
Pearson's Chi-squared test
data: po_y
X-squared = 72.605, df = 3, p-value = 0.000000000000001181
df <- (length(levels(inde_test$poutcome)) -1 ) * (length(levels(inde_test$y)) -1 )
print("Critical Value of Poutcome and Y Table:")
[1] "Critical Value of Poutcome and Y Table:"
qchisq(0.95, df)
[1] 7.814728
print("Cramer.V of Poutcome and Y:")
[1] "Cramer.V of Poutcome and Y:"
poutcome_v <- cramer.v(po_y)
poutcome_v
[1] 0.2971998
“poutcome” is associated with “y”. The association is pretty strong as cramer’s v larger than 0.3.
month
library(questionr)
print("Contingency Table of Month and Y:")
[1] "Contingency Table of Month and Y:"
month_y <- table(inde_test$month, inde_test$y)
month_y
no yes
apr 24 54
aug 60 55
dec 0 10
feb 27 45
jan 17 11
jul 43 53
jun 39 41
mar 6 17
may 113 86
nov 37 33
oct 3 22
sep 4 22
print("Chi-Squared Test of Month and Y Table:")
[1] "Chi-Squared Test of Month and Y Table:"
chisq.test(month_y)
Chi-squared approximation may be incorrect
Pearson's Chi-squared test
data: month_y
X-squared = 58.158, df = 11, p-value = 0.00000002035
df <- (length(levels(inde_test$month)) -1 ) * (length(levels(inde_test$y)) -1 )
print("Critical Value of Month and Y Table:")
[1] "Critical Value of Month and Y Table:"
qchisq(0.95, df)
[1] 19.67514
print("Cramer.V of Month and Y:")
[1] "Cramer.V of Month and Y:"
month_v <- cramer.v(month_y)
Chi-squared approximation may be incorrect
month_v
[1] 0.265992
“month” is associated with “y”, and the association is pretty strong.
previous (bool)
change previuos into bool
inde_test$previous_bool <- inde_test$previous > 0
inde_test$previous_bool <- as.factor(inde_test$previous_bool)
library(questionr)
print("Contingency Table of Previous_bool and Y:")
[1] "Contingency Table of Previous_bool and Y:"
previous_bool_y <- table(inde_test$previous_bool, inde_test$y)
previous_bool_y
no yes
FALSE 320 280
TRUE 53 169
print("Chi-Squared Test of Previous_bool and Y Table:")
[1] "Chi-Squared Test of Previous_bool and Y Table:"
chisq.test(previous_bool_y)
Pearson's Chi-squared test with Yates' continuity correction
data: previous_bool_y
X-squared = 55.555, df = 1, p-value = 0.00000000000009087
df <- (length(levels(inde_test$previous_bool)) -1 ) * (length(levels(inde_test$y)) -1 )
print("Critical Value of Previous_bool and Y Table:")
[1] "Critical Value of Previous_bool and Y Table:"
qchisq(0.95, df)
[1] 3.841459
print("Cramer.V of Previous_bool and Y:")
[1] "Cramer.V of Previous_bool and Y:"
previous_bool_v <- cramer.v(previous_bool_y)
previous_bool_v
[1] 0.2627237
“Previous_bool” and “y” is associated and the association strength is medium.
education
library(questionr)
print("Contingency Table of Education and Y:")
[1] "Contingency Table of Education and Y:"
education_y <- table(inde_test$education, inde_test$y)
education_y
no yes
primary 67 52
secondary 183 205
tertiary 107 167
unknown 16 25
print("Chi-Squared Test of Education and Y Table:")
[1] "Chi-Squared Test of Education and Y Table:"
chisq.test(education_y)
Pearson's Chi-squared test
data: education_y
X-squared = 11.322, df = 3, p-value = 0.0101
df <- (length(levels(inde_test$education)) -1 ) * (length(levels(inde_test$y)) -1 )
print("Critical Value of Education and Y Table:")
[1] "Critical Value of Education and Y Table:"
qchisq(0.95, df)
[1] 7.814728
print("Cramer.V of Education and Y:")
[1] "Cramer.V of Education and Y:"
education_v <- cramer.v(education_y)
education_v
[1] 0.1173641
“education” and “y” is associated but the association strength is weak.
housing
library(questionr)
print("Contingency Table of Housing and Y:")
[1] "Contingency Table of Housing and Y:"
housing_y <- table(inde_test$housing, inde_test$y)
housing_y
no yes
no 159 274
yes 214 175
print("Chi-Squared Test of Housing and Y Table:")
[1] "Chi-Squared Test of Housing and Y Table:"
chisq.test(housing_y)
Pearson's Chi-squared test with Yates' continuity correction
data: housing_y
X-squared = 26.929, df = 1, p-value = 0.000000211
df <- (length(levels(inde_test$housing)) -1 ) * (length(levels(inde_test$y)) -1 )
print("Critical Value of Housing and Y Table:")
[1] "Critical Value of Housing and Y Table:"
qchisq(0.95, df)
[1] 3.841459
print("Cramer.V of Housing and Y:")
[1] "Cramer.V of Housing and Y:"
housing_v <- cramer.v(housing_y)
housing_v
[1] 0.1834465
“housing” and “y” is associated and the association strength is medium.
library(questionr)
print("Contingency Table of Loan and Y:")
[1] "Contingency Table of Loan and Y:"
loan_y <- table(inde_test$loan, inde_test$y)
loan_y
no yes
no 307 400
yes 66 49
print("Chi-Squared Test of Loan and Y Table:")
[1] "Chi-Squared Test of Loan and Y Table:"
chisq.test(loan_y)
Pearson's Chi-squared test with Yates' continuity correction
data: loan_y
X-squared = 7.2329, df = 1, p-value = 0.007158
df <- (length(levels(inde_test$loan)) -1 ) * (length(levels(inde_test$y)) -1 )
print("Critical Value of Loan and Y Table:")
[1] "Critical Value of Loan and Y Table:"
qchisq(0.95, df)
[1] 3.841459
print("Cramer.V of Loan and Y:")
[1] "Cramer.V of Loan and Y:"
loan_v <- cramer.v(loan_y)
loan_v
[1] 0.09732567
“loan” and “y” is weakly associated.
contact
library(questionr)
print("Contingency Table of Contact and Y:")
[1] "Contingency Table of Contact and Y:"
contact_y <- table(inde_test$contact, inde_test$y)
contact_y
no yes
cellular 248 366
telephone 22 39
unknown 103 44
print("Chi-Squared Test of Contact and Y Table:")
[1] "Chi-Squared Test of Contact and Y Table:"
chisq.test(contact_y)
Pearson's Chi-squared test
data: contact_y
X-squared = 44.449, df = 2, p-value = 0.0000000002229
df <- (length(levels(inde_test$contact)) -1 ) * (length(levels(inde_test$y)) -1 )
print("Critical Value of Contact and Y Table:")
[1] "Critical Value of Contact and Y Table:"
qchisq(0.95, df)
[1] 5.991465
print("Cramer.V of Contact and Y:")
[1] "Cramer.V of Contact and Y:"
contact_v <- cramer.v(contact_y)
contact_v
[1] 0.2325378
“contact” and “y” is strongly associated.
default
library(questionr)
print("Contingency Table of Default and Y:")
[1] "Contingency Table of Default and Y:"
default_y <- table(inde_test$default, inde_test$y)
default_y
no yes
no 367 446
yes 6 3
print("Chi-Squared Test of Contact and Y Table:")
[1] "Chi-Squared Test of Contact and Y Table:"
chisq.test(default_y)
Chi-squared approximation may be incorrect
Pearson's Chi-squared test with Yates' continuity correction
data: default_y
X-squared = 0.90884, df = 1, p-value = 0.3404
df <- (length(levels(inde_test$default)) -1 ) * (length(levels(inde_test$y)) -1 )
print("Critical Value of Default and Y Table:")
[1] "Critical Value of Default and Y Table:"
qchisq(0.95, df)
[1] 3.841459
print("Cramer.V of Default and Y:")
[1] "Cramer.V of Default and Y:"
default_v <- cramer.v(default_y)
Chi-squared approximation may be incorrect
default_v
[1] 0.04499212
p-value is larger than 0.05, chi-squared value is smaller than critical value, so fail to reject the null hypothesis. We cannot conclude that “default” and “y” is associated.
marital
library(questionr)
print("Contingency Table of Marital and Y:")
[1] "Contingency Table of Marital and Y:"
marital_y <- table(inde_test$marital, inde_test$y)
marital_y
no yes
divorced 41 52
married 226 234
single 106 163
print("Chi-Squared Test of Marital and Y Table:")
[1] "Chi-Squared Test of Marital and Y Table:"
chisq.test(marital_y)
Pearson's Chi-squared test
data: marital_y
X-squared = 6.5475, df = 2, p-value = 0.03786
df <- (length(levels(inde_test$marital)) -1 ) * (length(levels(inde_test$y)) -1 )
print("Critical Value of Marital and Y Table:")
[1] "Critical Value of Marital and Y Table:"
qchisq(0.95, df)
[1] 5.991465
print("Cramer.V of Marital and Y:")
[1] "Cramer.V of Marital and Y:"
marital_v <- cramer.v(marital_y)
marital_v
[1] 0.08924855
No association between “marital” and “y”.
job and default
library(questionr)
print("Contingency Table of Default and Job:")
[1] "Contingency Table of Default and Job:"
default_job_new <- table(inde_test$default, inde_test$new_jobs)
default_job_new
admin. blue-collar management retired self-employed services technician Others
no 96 151 178 59 27 72 124 106
yes 0 1 2 0 1 0 1 4
print("Chi-Squared Test of Contact and Job Table:")
[1] "Chi-Squared Test of Contact and Job Table:"
chisq.test(default_job_new)
Chi-squared approximation may be incorrect
Pearson's Chi-squared test
data: default_job_new
X-squared = 11.029, df = 7, p-value = 0.1374
df <- (length(levels(inde_test$default)) -1 ) * (length(levels(inde_test$new_jobs)) -1 )
print("Critical Value of Default and Job Table:")
[1] "Critical Value of Default and Job Table:"
qchisq(0.95, df)
[1] 14.06714
print("Cramer.V of Default and Job:")
[1] "Cramer.V of Default and Job:"
default_job_new_v <- cramer.v(default_job_new)
Chi-squared approximation may be incorrect
default_job_new_v
[1] 0.1158313
No association between job and default.
Visualize the results
var <- c("job", "poutcome", "month", "previous_bool", "education", "housing", "loan",
"contact", "marital", "default")
v <- c(job_v, poutcome_v, month_v, previous_bool_v, education_v, housing_v, loan_v, contact_v, marital_v,
default_v)
chi_test <- data.frame(var, v)
names(chi_test) <- c("Variables", "Cramer.V")
chi_test
library(ggplot2)
ggplot(data=chi_test, aes(x=reorder(Variables, -Cramer.V), y=Cramer.V, fill=Variables)) +
geom_bar(stat="identity")+
theme_classic() + ylab("Cramer's V") + xlab("Categorical Variables") +
ggtitle("Cramer's V of Categorical Variables") +
theme(plot.title = element_text(hjust = 0.5, size=14, face="bold"),
axis.text.x = element_text(angle = 45, hjust = 1, face="bold", size=14))
NA
Logistic Regression is for a continous varible and a categorical variable. Variables include: duration, age, balance, campaign, pdays, day
Between predictor variables: age and housing
duration
lr_duration <- glm(y ~ duration, data = inde_test, family = "binomial")
summary(lr_duration)$coefficient
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.251107855 0.1380794760 -9.060781 0.0000000000000000001295201367685781
duration 0.004172182 0.0003812472 10.943509 0.0000000000000000000000000007138219
age
lr_age <- glm(y ~ duration + age, data = inde_test, family = "binomial")
summary(lr_age)$coefficient
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.370238978 0.3006851422 -4.5570558 0.0000051875640742799672725741461088
duration 0.004176776 0.0003817469 10.9412201 0.0000000000000000000000000007320799
age 0.002871328 0.0064213670 0.4471521 0.6547652432307282666101855284068733
lr <- glm(y ~ duration + age + balance + campaign + pdays + day, data = inde_test, family = "binomial")
summary(lr)
Call:
glm(formula = y ~ duration + age + balance + campaign + pdays +
day, family = "binomial", data = inde_test)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.7719 -0.8390 0.2247 0.8815 2.1955
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.13567078 0.35562163 -3.193 0.00141 **
duration 0.00433787 0.00039481 10.987 < 0.0000000000000002 ***
age -0.00032327 0.00675359 -0.048 0.96182
balance 0.00009650 0.00003147 3.067 0.00216 **
campaign -0.14777000 0.04141840 -3.568 0.00036 ***
pdays 0.00445010 0.00082303 5.407 0.0000000641 ***
day -0.01368338 0.00994092 -1.376 0.16868
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1132.5 on 821 degrees of freedom
Residual deviance: 866.1 on 815 degrees of freedom
AIC: 880.1
Number of Fisher Scoring iterations: 5
library(caret)
imp <- as.data.frame(varImp(lr))
imp <- data.frame(overall = imp$Overall,
names = rownames(imp))
lr_test <- imp[order(imp$overall,decreasing = T),]
names(lr_test) <- c("Importance", "Variables")
lr_test
One-Way ANOVA Test for Age and Y
“age” is normal distribution, so it can apply anova test.
oneway.test(age~ y, bank, var.equal=T)
One-way analysis of means
data: age and y
F = 98.55, num df = 1, denom df = 82232, p-value < 0.00000000000000022
Critical value
qf(0.95, 1, 82232)
[1] 3.841572
F value >= Critical Value, and p-value < 0.05, so we reject the null hypothesis. There is association between “age” and “y”.
Visualize the test results
library(ggplot2)
ggplot(data=lr_test, aes(x=reorder(Variables, -Importance), y=Importance, fill=Variables)) +
geom_bar(stat="identity")+
theme_classic() + ylab("Importance") + xlab("Continuous Variables") +
ggtitle("Importance of Continous Variables") +
theme(plot.title = element_text(hjust = 0.5, size=14, face="bold"), axis.text.x = element_text(
face="bold", angle = 45, hjust = 1, size=14))
Feature Selection
feature_group1: month, poutcome, contact, duration feature_group2: month, poutcome, contact, duration, job, housing, previous_bool
bank_base <- bank
The order of level is by alphabet. “No” is the positive class by default. Change the order of the levels, so “yes” will be the positive class in the model.
bank_base$y = factor(bank_base$y,levels(bank_base$y)[c(2,1)])
levels(bank_base$y)
[1] "yes" "no"
Now the “yes” will be the positive class in the models
Two baseline classifiction models are created using decision tree and random forest with all the 16 variables.
## 75% of the sample size
size_base <- floor(0.75 * nrow(bank_base))
## set the seed to make your partition reproducible
set.seed(123)
# sample(seq_len(nrow(bank)), size = size1) -> the index of training set
train <- bank_base[sample(seq_len(nrow(bank_base)), size = size_base), ]
test <- bank_base[-sample(seq_len(nrow(bank_base)), size = size_base), ]
Create a decision tree
library(rpart)
base_dt <- rpart(y ~ ., train)
printcp(base_dt)
Classification tree:
rpart(formula = y ~ ., data = train)
Variables actually used in tree construction:
[1] contact duration housing month poutcome
Root node error: 29962/61675 = 0.4858
n= 61675
CP nsplit rel error xerror xstd
1 0.428977 0 1.00000 1.00000 0.0041427
2 0.044056 1 0.57102 0.57156 0.0037120
3 0.022962 3 0.48291 0.48308 0.0035127
4 0.022128 4 0.45995 0.47176 0.0034838
5 0.019992 5 0.43782 0.43836 0.0033933
6 0.017088 6 0.41783 0.41846 0.0033357
7 0.010000 9 0.36656 0.36706 0.0031728
plotcp(base_dt)
#optimal cp for baseline tree
opt_index <- which.min(base_dt$cptable[, "xerror"])
cp_opt <- base_dt$cptable[opt_index, "CP"]
cp_opt
[1] 0.01
library(rpart.plot)
rpart.plot(base_dt)
Make prediction on the test set
test$BaseDT_Predict <- predict(base_dt, test[, 1:16], type = "class")
Compare the true y and the predicted y
head(test[, 17:18])
Build models with the features selected. Feature group 1: month + poutcome + contact + duration; Feature group 2: month + poutcome + contact + duration + job + housing + previous_bool
Make a copy of bank. Create a new column “previous_bool” #18
bank_exp <- bank_base
#replace the orignal previous (numeric) with boolean
bank_exp$previous <- as.factor(bank_exp$previous>0)
head(bank_exp)
For the models use the feature group 1, the training set and test set are the same as the base line model. For the models use the feature group 2, use the training and test set below:
## 75% of the sample size
size_exp <- floor(0.75 * nrow(bank_exp))
## set the seed to make your partition reproducible
set.seed(123)
# sample(seq_len(nrow(bank)), size = size1) -> the index of training set
train_exp <- bank_exp[sample(seq_len(nrow(bank_exp)), size = size_exp), ]
test_exp <- bank_exp[-sample(seq_len(nrow(bank_exp)), size = size_exp), ]
The training set and test set are the same as in the baseline models because the seed is the same.
Decision Tree
cp ydefault =0.01
library(rpart)
# decision tree with feature group 1
tree1 <- rpart(y ~ month + poutcome + contact + duration, data = train)
# decision tree with feature group 2, the "previous" is bool, but with the same name
tree2 <- rpart(y ~ month + poutcome + contact + duration + job +
housing + previous, data = train_exp)
get the optimal cp based on cross-validation error
printcp(tree1)
Classification tree:
rpart(formula = y ~ month + poutcome + contact + duration, data = train)
Variables actually used in tree construction:
[1] contact duration month poutcome
Root node error: 29962/61675 = 0.4858
n= 61675
CP nsplit rel error xerror xstd
1 0.428977 0 1.00000 1.00000 0.0041427
2 0.044056 1 0.57102 0.57156 0.0037120
3 0.022962 3 0.48291 0.48308 0.0035127
4 0.022128 4 0.45995 0.47176 0.0034838
5 0.019992 5 0.43782 0.43836 0.0033933
6 0.010000 6 0.41783 0.41846 0.0033357
plotcp(tree1)
printcp(tree2)
Classification tree:
rpart(formula = y ~ month + poutcome + contact + duration + job +
housing + previous, data = train_exp)
Variables actually used in tree construction:
[1] contact duration housing month poutcome
Root node error: 29962/61675 = 0.4858
n= 61675
CP nsplit rel error xerror xstd
1 0.428977 0 1.00000 1.00000 0.0041427
2 0.044056 1 0.57102 0.57152 0.0037120
3 0.022962 3 0.48291 0.48321 0.0035131
4 0.022128 4 0.45995 0.46796 0.0034739
5 0.019992 5 0.43782 0.43839 0.0033934
6 0.017088 6 0.41783 0.41840 0.0033355
7 0.010000 9 0.36656 0.36686 0.0031721
tree2 seems to be the same as baseline tree
plotcp(tree2)
#optimal cp for tree1
opt_index1 <- which.min(tree1$cptable[, "xerror"])
cp_opt1 <- tree1$cptable[opt_index1, "CP"]
#optimal cp for tree2
opt_index2 <- which.min(tree2$cptable[, "xerror"])
cp_opt2 <- tree2$cptable[opt_index2, "CP"]
cp_opt1
[1] 0.01
cp_opt2
[1] 0.01
library(rpart.plot)
rpart.plot(tree1)
rpart.plot(tree2)
Not all the input variables are used.
Make prediction on the test set
test$Tree1_Predict <- predict(tree1, test[, 1:16], type = "class")
test_exp$Tree2_Predict <- predict(tree2, test_exp[, 1:16], type = "class")
Compare the true y and the predicted y
head(test[, c(17, 19)])
head(test_exp[, c(17, 18)])
Random Forest
library(randomForest)
# random forest with feature group 1
forest1<- randomForest(y ~ month + poutcome + contact + duration, data = train)
library(randomForest)
# random forest with feature group 2
forest2<- randomForest(y ~ month + poutcome + contact + duration + job +
housing + previous, data = train_exp)
runs long!
Make prediction on test set
library(randomForest)
test$Forest1_Predict <- predict(forest1, test[, 1:16])
test_exp$Forest2_Predict <- predict(forest2, test_exp[, 1:16])
baseline decision tree
library(caret)
cm_basedt <- confusionMatrix(test$BaseDT_Predict, test$y)
cm_basedt
Confusion Matrix and Statistics
Reference
Prediction yes no
yes 9371 2479
no 1156 7553
Accuracy : 0.8232
95% CI : (0.8179, 0.8284)
No Information Rate : 0.512
P-Value [Acc > NIR] : < 0.00000000000000022
Kappa : 0.6451
Mcnemar's Test P-Value : < 0.00000000000000022
Sensitivity : 0.8902
Specificity : 0.7529
Pos Pred Value : 0.7908
Neg Pred Value : 0.8673
Prevalence : 0.5120
Detection Rate : 0.4558
Detection Prevalence : 0.5764
Balanced Accuracy : 0.8215
'Positive' Class : yes
experimental decision tree with feature group 1
library(caret)
cm_tree1<-confusionMatrix(test$Tree1_Predict, test$y)
cm_tree1
Confusion Matrix and Statistics
Reference
Prediction yes no
yes 9842 3534
no 685 6498
Accuracy : 0.7948
95% CI : (0.7892, 0.8003)
No Information Rate : 0.512
P-Value [Acc > NIR] : < 0.00000000000000022
Kappa : 0.5866
Mcnemar's Test P-Value : < 0.00000000000000022
Sensitivity : 0.9349
Specificity : 0.6477
Pos Pred Value : 0.7358
Neg Pred Value : 0.9046
Prevalence : 0.5120
Detection Rate : 0.4787
Detection Prevalence : 0.6506
Balanced Accuracy : 0.7913
'Positive' Class : yes
experimental decision tree with feature group 2
library(caret)
cm_tree2<-confusionMatrix(test_exp$Tree2_Predict, test_exp$y)
cm_tree2
Confusion Matrix and Statistics
Reference
Prediction yes no
yes 9371 2479
no 1156 7553
Accuracy : 0.8232
95% CI : (0.8179, 0.8284)
No Information Rate : 0.512
P-Value [Acc > NIR] : < 0.00000000000000022
Kappa : 0.6451
Mcnemar's Test P-Value : < 0.00000000000000022
Sensitivity : 0.8902
Specificity : 0.7529
Pos Pred Value : 0.7908
Neg Pred Value : 0.8673
Prevalence : 0.5120
Detection Rate : 0.4558
Detection Prevalence : 0.5764
Balanced Accuracy : 0.8215
'Positive' Class : yes
experimental random forest with feature group 1
library(caret)
cm_forest1<-confusionMatrix(test$Forest1_Predict, test$y)
cm_forest1
Confusion Matrix and Statistics
Reference
Prediction yes no
yes 9593 1758
no 934 8274
Accuracy : 0.8691
95% CI : (0.8644, 0.8736)
No Information Rate : 0.512
P-Value [Acc > NIR] : < 0.00000000000000022
Kappa : 0.7375
Mcnemar's Test P-Value : < 0.00000000000000022
Sensitivity : 0.9113
Specificity : 0.8248
Pos Pred Value : 0.8451
Neg Pred Value : 0.8986
Prevalence : 0.5120
Detection Rate : 0.4666
Detection Prevalence : 0.5521
Balanced Accuracy : 0.8680
'Positive' Class : yes
experimental random forest with feature group 2
library(caret)
cm_forest2<-confusionMatrix(test_exp$Forest2_Predict, test_exp$y)
cm_forest2
Confusion Matrix and Statistics
Reference
Prediction yes no
yes 9748 1584
no 779 8448
Accuracy : 0.8851
95% CI : (0.8806, 0.8894)
No Information Rate : 0.512
P-Value [Acc > NIR] : < 0.00000000000000022
Kappa : 0.7696
Mcnemar's Test P-Value : < 0.00000000000000022
Sensitivity : 0.9260
Specificity : 0.8421
Pos Pred Value : 0.8602
Neg Pred Value : 0.9156
Prevalence : 0.5120
Detection Rate : 0.4741
Detection Prevalence : 0.5512
Balanced Accuracy : 0.8841
'Positive' Class : yes
forest1$importance
MeanDecreaseGini
month 4084.021
poutcome 2612.859
contact 1553.985
duration 11575.615
forest2$importance
MeanDecreaseGini
month 3097.7082
poutcome 1755.3077
contact 1226.9245
duration 9517.5597
job 939.8455
housing 821.8555
previous 608.1455
ROC Curve
library('pROC')
Type 'citation("pROC")' for a citation.
Attaching package: ‘pROC’
The following objects are masked from ‘package:stats’:
cov, smooth, var
plot(roc(test$y, predict(base_dt, test[,1:16], type = "prob")[,2]), col="darkgoldenrod1 ", legacy.axes=T, xlab="False Positive Rate (1-Specificity)", ylab="True Positive Rate (Sensitivity)", main="Compare Models")
plot(roc(test$y, predict(tree1, test[,1:16], type = "prob")[,2]), add=T, col="aquamarine")
plot(roc(test_exp$y, predict(tree2, test_exp[,1:16], type = "prob")[,2]), add=T, col="coral")
plot(roc(test$y, predict(forest1, test[,1:16], type = "prob")[,2]), add=T, col="chartreuse1")
plot(roc(test_exp$y, predict(forest2, test_exp[,1:16], type = "prob")[,2]), add=T, col="purple")
legend("bottomright", legend=c("Baseline Tree", "Tree1", "Tree2", "Forest1", "Forest2"),col=c("darkgoldenrod1", "aquamarine", "coral", "chartreuse1", "purple"), lwd=2)
Random Forest works better than Decision Tree. Feature group 2 is better than feature group 1. Decision tree with feature group 2 is the same as baseline decision tree. The two tree is exactly the same. Decision Tree with feature group 1 is slightly worse than baseline.
cm <- c(cm_basedt$byClass[1], cm_tree1$byClass[1], cm_tree2$byClass[1], cm_forest1$byClass[1],
cm_forest2$byClass[1])
cm_df <- data.frame(cm)
cm_df$Model <- c("Baseline Tree", "Tree1", "Tree2", "Forest1", "Forest2")
colnames(cm_df)<- c("Sensitivity", "Model")
cm_df <- cm_df[, c(2,1)]
cm_df
library(ggplot2)
ggplot(data=cm_df, aes(x=reorder(Model, -Sensitivity), y=Sensitivity, fill=Model)) +
geom_bar(stat="identity")+
theme_classic() + ylab("Sensitivity") + xlab("Model") +
ggtitle("Compare Model Sensitivity") +
theme(plot.title = element_text(hjust = 0.5, size=14, face="bold"),
axis.text.x = element_text(angle = 45, hjust = 1, face="bold", size=14))
NA
ADD-ON: Create a baseline random forest
This will not go to report
default ntree=500
library(randomForest)
base_rf1<- randomForest(y ~ ., data = train)
Make prediction on the test set
test$BaseRF1_Predict <- predict(base_rf1, test[, 1:16], type = "class")
Compare the true y and the predicted y
head(test[, c(17,19)])
ntree = 500
library(caret)
cm_baserf <- confusionMatrix(test$BaseRF1_Predict, test$y)
cm_baserf
Confusion Matrix and Statistics
Reference
Prediction yes no
yes 10527 152
no 0 9880
Accuracy : 0.9926
95% CI : (0.9913, 0.9937)
No Information Rate : 0.512
P-Value [Acc > NIR] : < 0.00000000000000022
Kappa : 0.9852
Mcnemar's Test P-Value : < 0.00000000000000022
Sensitivity : 1.0000
Specificity : 0.9848
Pos Pred Value : 0.9858
Neg Pred Value : 1.0000
Prevalence : 0.5120
Detection Rate : 0.5120
Detection Prevalence : 0.5194
Balanced Accuracy : 0.9924
'Positive' Class : yes